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Abstract 



To understand a speaker's turn of a con- 
versation, one needs to segment it into in- 
tonational phrases, clean up any speech re- 
pairs that might have occurred, and iden- 
tify discourse markers. In this paper, we 
argue that these problems must be resolved 
together, and that they must be resolved 
early in the processing stream. We put for- 
ward a statistical language model that re- 
solves these problems, does POS tagging, 
and can be usud as Lho language uiudcd of 
a speech recognizer. We find LhaL by ac- 



AUen, 1995), gives an example of a speech repair 
with the words that the speaker intends to be re- 
placed marked by reparandum, the words that are 
the intended replacement marked as alteration, and 
the cue phrases and filled pauses that tend to occur 
in between marked as the editing term. 

Example 1 (d92a-5.2 utt34) 

we'll pick up a tank of uh the tanker of oranges 

reparandurn^ editing term alteration 
interruption point 

Much work has been done on both detect- 
ing boundary tones (e.g. (Wang and Hirschberg, 



1992; Wightman and Ostendorf, 1994; Stolcke and 



couiiLiug for Lhe iuLeracLiuus between these 



Shriberg, 1996a; Kompe et al., 1994; Mast et al 



tasks that the perfunnauce on each task 
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Stolcke and Shriberg, I996b|)). This 



1 Introduction 

Interactive spoken dialog provides many new chal- 
lenges for natural language understanding systems. 
One of the most critical challenges is simply de- 
termining the speaker's intended utterances: both 
segmenting the speaker's turn into utterances and 
determining the intended words in each utterance. 
Since there is no well-agreed to definition of what 
an utterance is, we instead focus on intonational 



phrases (Silverman et al., 1992), which end with an 
acoustically signaled boundary tone. Even assuming 
perfect word recognition, the problem of determin- 
ing the intended words is complicated due to the 
jccuiieuce uf speech lepaiis, which uccui wlieie the 
^peakei goes back and changes (ui lepeaLs) sume- 
thing she just said. The words that are replaced 
or repeated are no longer part of the intended ut- 
terance, and so need to be identified. The foUow- 



work has focused on one of the issues in isolation of 
the other. However, these two issues are intertwined. 
Cues such as the presence of silence, final syllable 
lengthening, and presence of filled pauses tend to 
mark both events. Even the presence of word cor- 
respondences, a tradition cue for detecting and cor- 
recting speech repairs, sometimes marks boundary 
tones as well, as illustrated by the following example 
where the intonational phrase boundary is marked 
with the ToBI symbol %. 

Example 2 (d93-83.3 utt73) 

that's all you need % you only need one boxcar 

Intonational phrases and speech repairs also in- 
teract with the identification of discourse markers. 



Discourse markers ( Schiffrin, 1987 ^ Hirschberg and 



Litman, 1993; Byron and Heeman, 1997) are used 



ing example, from the Trains corpus (Heeman and 



In Proceedings of ACL/EACL'97 



to relate new speech to the current discourse state. 
Lexical items that can function as discourse mark- 
ers, such as "well" and "okay," are ambiguous as to 
whether they are being used as discourse markers 
or not. The complication is that discourse markers 
tend to be used to introduce a new utterance, or 



can be an utterance all to themselves (such as the 
acknowledgment "okay" or "alright"), or can be used 
as part of the editing term of a speech repair, or to 
begin the alteration. Hence, the problem of identi- 
fying discourse markers also needs to be addressed 
with the segmentation and speech repair problems. 

These three phenomena of spoken dialog, however, 
cannot be resolved without recourse to syntactic in- 
formation. Speech repairs, for example, are often 
signaled by syntactic anomalies. Furthermore, in 
order to determine the extent of the reparandum, 
one needs to take into account the parallel structure 
that typically exists between the reparandum and al- 
teration, which relies on at identifying the syntactic 



volvcd ( 


Bear, Dowding, and Shriberg, 1992 




Hecman 


and Allen, 1994 


). However, speech repairs disrupt 



the c ontext that i s needed to determine the POS 
tags (Hindlc, 1983). Hence, speech repairs, as well 



as boundary tones and discourse markers, must be 
resolved during syntactic disambiguation. 

Of course when dealing with spoken dialogue, one 
cannot forget the initial problem of determining the 
actual words that the speaker is saying. Speech rec- 
ognizers rely on being able to predict the probabil- 
ity of what word will be said next. Just as intona- 
tional phrases and speech repairs disrupt the local 
context that is needed for syntactic disambiguation, 
the same holds for predicting what word will come 
next. If a speech repair or intonational phrase oc- 
curs, this will alter the probability estimate. But 
more importantly, speech repairs and intonational 
phrases have acoustic correlates such as the pres- 
ence of silence. Current speech recognition language 
models cannot account for the presence of silence, 
and tend to simply ignore it. By modeling speech re- 
pairs and intonational boundaries, we can take into 
account the acoustic correlates and hence use more 
of the available information. 

From the above discussion, it is clear that we need 
to model these dialogue phenomena together and 
very early on in the speech processing stream, in 
fact, during speech recognition. Currently, the ap- 
proaches that work best in speech recognition are 
statistical approaches that are able to assign proba- 
bility estimates for what word will occur next given 
the previous words. Hence, in this paper, we in- 
troduce a statistical language model that can de- 
tect speech repairs, boundary tones, and discourse 
markers, and can assign POS tags, and can use this 
information to better predict what word will occur 
next. 

In the rest of the paper, we first introduce the 
Trains corpus. We then introduce a statistical Ian- 



Dialogs 
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Speakers 


34 


Words 


58298 


Turns 


OlOO 


Discourse Markers 


8278 


Boundary Tones 


10947 


Turn-Internal Boundary Tones 


5535 


Abridged Repairs 


423 


Modification Repairs 


1302 


Fresh Starts 


671 


Editing Terms 


1128 



Table 1: Frequency of Tones, Repairs and Editing 
Terms in the Trains Corpus 

guage model that incorporates POS tagging and the 
identification of discourse markers. We then aug- 
ment this model with speech repair detection and 
correction and intonational boundary tone detec- 
tion. We then present the results of this model on 
the Trains corpus and show that it can better ac- 
count for these discourse events than can be achieved 
by modeling them individually. We also show that 
by modeling these two phenomena that we can in- 
crease our POS tagging performance by 8.6%, and 
improve our ability to predict the next word. 

2 Trains Corpus 



As part of the Trains project (Allen et al., 1995), 
which is a long term research project to build a con- 
versationally proficient planning assistant, we have 



collected a corpus of problem solving dialogs (Hee- 
man and Allen, 1995). The dialogs involve two hu- 



man participants, one who is playing the role of a 
user and has a certain task to accomplish, and an- 
other who is playing the role of the system by acting 
as a planning assistant. The collection methodology 
was designed to make the setting as close to human- 
computer interaction as possible, but was not a wiz- 
ard scenario, where one person pretends to be a com- 
puter. Rather, the user knows that he is talking to 
another person. 

The Trains corpus consists of about six and half 
hours of speech. Table |l| gives some general statistics 
about the corpus, including the number of dialogs, 
speakers, words, speaker turns, and occurrences of 
discourse markers, boundary tones and speech re- 
pairs. 

The speech repairs in the Trains corpus have been 
hand-annotated. We have divided the repairs into 
three types: fresh starts, modification repairs, and 
abridged repairs^ A fresh start is where the speaker 



^This cla ssifica tion is similar to that of Hindle (1983) 
and Levelt (|l983|). 



abandons the current utterance and starts again, Marcinkiewicz, 1993), includes special tags for de- 



where the abandonment seems acoustically signaled. 



Example 3 (d93-12.1 uttSO) 

so it'll take um^ so you want to do what 



rp.pnrandvm \ pditinrj tp.rm. 



alteration 



interruption point 
The second type of repairs are the modification re- 
pairs. These include all other repairs in which the 
reparandum is not empty. 
Example 4 (d92a-1.3 utt65) 

so that will total will take seven hours to do that 

reparandum^ alteration 
interruption point 

The third type of repairs are the abridged repairs, 
which consist solely of an editing term. Note that 
utterance initial filled pauses are not treated as 
abridged repairs. 
Example 5 (d93-14.3 utt42) 

we need to um manage to get the bananas to Dansville 

T editing term 
interruption point 

There is typically a correspondence between 
the reparan dum a nd the alteration, and following 
Bear et al. ( 1992 ), we annotate this using the la- 
bels m for word matching and r for word replace- 
ments (words of the same syntactic category) . Each 
pair is given a unique index. Other words in the 
reparandum and alteration are annotated with an 
x. Also, editing terms (filled pauses and clue words) 
are labeled with et, and the interruption point with 
ip, which will occur before any editing terms asso- 
ciated with the repair, and after a word fragment, 
if present. The interruption point is also marked as 
to whether the repair is a fresh start, modification 
repair, or abridged repair, in which cases, we use 
ip:can, ip:mod and ip:abr, respectively. The ex- 
ample below illustrates how a repair is annotated in 
this scheme. 

Example 6 (d93-15.2 utt42) 

engine two from Elmi(ra)- or engine three from Elmira 

ml r2 m3 m4 ^et ml r2 m3 m4 
iprmod 

3 A POS-Based Language Model 

The goal of a speech recognizer is to find the se- 



noting when a word is being used as a discourse 
marker. In this section, we give an overview of our 
basic language model that incorpora tes POS tag- 
ging. Full details can be found in (Heeman and 



Allen, 1997; Heeman, 1997) 



To add in POS tagging, we change the goal of the 
speech recognition process to find the best word and 
POS tags given the acoustic signal. The derivation 
of the acoustic model and language model is now as 
follows. 



WP 



argmaxPr(VrPM) 

W,P 



arg max 

WP 



¥v{A\WP) Pr(W^P) 
Pr(A) 



= argmaxPr(A|W^F)Pr(VKF) 

The first term Vt:{A\WP) is the factor due to 
the acoustic model, which we can approximate by 
Vv{A\W). The second term Pr(M^P) is the factor 
due to the language model. We rewrite Pr(Vl^P) as 
Pr(MKi^jv-Pi,Ar), where N is the number of words in 
the sequence. We now rewrite the language model 
probability as follows. 

Pr(VFi,jvPi,jv) 

= W Pr(l^,P,|W^i,,-iPi,-i) 

4=1, JV 

= W Pr(W^,|W^i,,-iPi,0Pi-(P4|M^M-iPM-i) 

4=1, JV 



We now have two probability distributions that we 
need to estimate, which we do using decision trees 
( iBreiman et ah, 1984| ; |Bahl et al., 1989| ). The de- 



cision tree algorithm has the advantage that it uses 
information theoretic measures to construct equiva- 
lence classes of the context in order to cope with 
sparseness of data. The decision tree algorithm 
starts with all of the training data in a single leaf 
node. For each leaf node, it looks for the question 
to ask of the context such that splitting the node 
into two leaf nodes results in the biggest decrease 
in impurity, where the impurity measures how well 
each leaf predicts the events in the node. After the 
tree is grown, a heldout dataset is used to smooth 



the probabi lities of each node with its parent ( Bahl 
quence of words W that is maximal given the acous- et al., 19891 ). 



tic signal A. However, for detecting and correcting 
speech repairs, and identifying boundary tones and 
discourse markers, we need to augment the model 
so that it incorporates shallow statistical analysis, in 
the form of POS tagging. T he POS tagset, based on 
the Penn Treebank tagset (Marcus, Santorini, and 



To allow the decision tree to ask about the words 
and POS tags in the context, we cluster the words 
and POS tags using the algorithm of Brown et 
al. (1992) into a binary classification tree. This gives 
an implicit binary encoding for each word and POS 
tag, thus allowing the decision tree to ask about the 
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Figure 1: POS Classification Tree 



words and POS tags using simple binary questions, 
such as 'is the third bit of the POS tag encoding 
equal to one?' Figure |] shows a POS classification 
tree. The binary encoding for a POS tag is deter- 
mined by the sequence of top and bottom edges that 
leads from the root node to the node for the POS 
tag. 



Unlike other work (e.g. (Black et al., 1992; Mager 



man, 1995)), we treat the word identities as a further 
refinement of the POS tags; thus we build a word 
classification tree for each POS tag. This has the 
advantage of avoiding unnecessary data fragmenta- 
tion, since the POS tags and word identities are no 
longer separate sources of information. As well, it 
constrains the task of building the word classifica- 
tion trees since the major distinctions are captured 
by the POS classification tree. 

4 Augmenting the Model 

Just as we redefined the speech recognition prob- 
lem so as to account for POS tagging and identify- 
ing discourse markers, we do the same for modeling 
boundary tones and speech repairs. We introduce 
null tokens between each pair of consecutive words 
Wi-i and Wi (Heeman and Allen, 1994), which will 
be tagged as to the occurrence of these events. The 
boundary tone tag Ti indicates if word Wi-i ends an 
intonational boundary (Ti=T), or not (Ti=null). 
For detecting speech repairs, we have the prob- 



Figure 2: Cross Serial Correspondences 



lem that repairs are often accompanied by an edit- 
ing term, such as "um", "uh", "okay", or "well", 
and these must be identified as such. Furthermore, 
an editing term might be composed of a number of 
words, such as "let's see" or "uh well" . Hence we use 
two tags: an editing term tag Ei and a repair tag Ri. 
The editing term tag indicates if wi starts an edit- 
ing term (£^i=Push), if Wi continues an editing term 
(i?i=ET), if Wi-i ends an editing term {Ei— Pop), 
or otherwise (£'i=null). The repair tag Ri indicates 
whether word Wi is the onset of the alteration of a 
fresh start (i?i=C), a modification repair (Ri—'M.), 
or an abridged repair (i?i=A), or there is not a re- 
pair (i?i=null). Note that for repairs with an edit- 
ing term, the repair is tagged after the extent of the 
editing term has been determined. Below we give an 
example showing all non-null tone, editing term and 
repair tags. 

Example 7 (d93-18.1 utt47) 

it takes one Push you ET know Pop M two hours T 

If a modification repair or fresh start occurs, 
we need to determine the extent (or the onset) 
of the reparandum, which we refer to as correct- 
ing the speech repair. Often, speech repairs have 
strong word correspondences between the reparan- 
dum and alteration, involving word matches and 
word replacements. Hence, knowing the extent of 
the reparandum means that we can use the reparan- 
dum to predict the words (and their POS tags) that 
make up the alteration. For Ri G {Mod, Can}, we 
define Oi to indicate the onset of the reparandum.^ 

If we are in the midst of processing a repair, we 
need to determine if there is a word correspondence 
from the reparandum to the current word Wi. The 
tag Li is used to indicate which word in the reparan- 
dum is licensing the correspondence. Word cor- 
respondences tend to exhibit a cross serial depen- 
dency; in other words if we have a correspondence 
between Wj in the reparandum and Wk in the alter- 
ation, any correspondence with a word in the alter- 
ation after Wk will be to a word that is after wj , as il- 
lustrated in Figure ||. This means that if wt involves 
a word correspondence, it will most likely be with a 
word that follows the last word in the reparandum 



Rather than estimate Oi directly, we instead query 
each potential onset to see how likely it is to be the actual 
onset of the reparandum. 



that has a word correspondence. Hence, we restrict 
Li to only those words that are after the last word in 
the reparandum that has a correspondence (or from 
the reparandum onset if there is not yet a correspon- 
dence) . If there is no word correspondence for wi , we 
set Li to the first word after the last correspondence. 

The second tag involved in the correspondences is 
Ci, which indicates the type of correspondence be- 
tween the word indicated by Li and the current word 
Wi. We focus on word correspondences that involve 
either a word match (Ci=m), a word replacement 
(Ci=r), where both words are of the same POS tag, 
or no correspondence (Ci=x). 

Now that we have defined these six additional tags 
for modeling boundary tones and speech repairs, we 
redefine the speech recognition problem so that its 
goal is to find the maximal assignment for the words 
as well as the POS, boundary tone, and speech repair 
tags. 

WPCLORET = argmax Pr{WCLORET\A) 

WPCLORET 

The result is that we now have eight probability dis- 
tributions that we need to estimate. 

Vr(Ti\W^i-\P\^i-\C\^i-\Lxi-iO\^i-iR\^i-\Exi-\T\^i-\) 

Pr{E,\Wi,,-iPi,i-iCi,i-iLi^,-iOi,,-iRi,,-iEi,,-iTi,i) 

Pr{R,\Wl,^-lPl,^-lCl,^-lLl,,-lOl,,-lRJ^,-lEx^Ti,,) 
P^[0,\Wx^-lPx^-lCx^-lLl,^-lOx^-lRx^Ex^Tx^) 
P'T:[Li\Wxi-lPxi-\Cxi-\Lxi-lOxiRxiExiTxi) 
Pr{C^\Wx^-lPx^-lCx^-lLxiOx^Rx^Ex^Tx^) 
Pr{P^\Wx^-lPx^-lCx^Lx^Ox^Rx^Ex^Tx^) 

Pr{W.\Wx^-lPxiCxiLx^OxiRxiEx^Txi) 

The context for each of the probability distribu- 
tions includes all of the previous context. In princi- 
pal, we could give all of this context to the decision 
tree algorithm and let it decide what information 
is relevant in constructing equivalence classes of the 
contexts. However, the amount of training data is 
limited (as are the learning techniques) and so we 
need to encode the context in order to simplify the 
task of constructing meaningful equivalence classes. 
We start with the words and their POS tags that 
are in the context and for each non-null tone, editing 
term (we also skip over i?=ET), and repair tag, we 
insert it into the appropriate place, just as Kompe et 



al. (1994) do for boundary tones in their language 
model. Below we give the encoded context for the 
word "know" from Example ^ 

Example 8 (d93-18.1 utt47) 

it/PRP takes/VBP one/CD Push you/PRP 



The result of this is that the non-null tag values are 
treated just as if they were lexical items.^ Further- 
more, if an editing term is completed, or the extent 
of a repair is known, we can also clean up the edit- 
ing term or reparandum, respectively, in the same 
way that Stolcke and Shriberg ( 1996b| ) clean up filled 
pauses, and simple repair patterns. This means that 
we can then generalize between fluent speech and 
instances that have a repair. For instance, in the 
two examples below, the context for the word "get" 
and its POS tag will be the same for both, namely 
"so/CCJD wc/PRP nced/VBP to/TO". 

Example 9 (d93-ll.l utt46) 

so we need to get the three tankers 

Example 10 (d92a-2.2 utt6) 

so we need to Push um Pop A get a tanker of OJ 

We also include other features of the context. For 
instance, we include a variable to indicate if we are 
currently processing an editing term, and whether 
a non-filled pause editing term was seen. For es- 
timating we include the editing terms as well. 
For estimating Oi, we include whether the proposed 
reparandum includes discourse markers, filled pauses 
that are not part of an editing term, boundary terms, 
and whether the proposed reparandum overlaps with 
any previous repair. 

5 Silences 

Silence, as well as other acoustic information, can 
also give evidence as to whether an intonational 
phrase, speech repair, or editing term occurred. We 
include S'i, the silence duration between word Wi-i 
and Wi, as part of the context for conditioning the 
probability distributions for the tone T^, editing 
term Ei, and repair Ri tags. Due to sparseness of 
data, we make several the independence assumptions 
so that we can separate the silence information from 
the rest of the context. For example, for the tone 
tag, let Resti represent the rest of the context that 
is used to condition T^. By assuming that Resti and 
Si are independent, and are independent given Ti, 
we can rewrite Pr(r,;|S'ii?esti) as follows. 



VY{T,\S,ResU) ^VY{T,\ResU) 



Pr(rd^»-i) 

Pr(r,) 



We can now use ^^^VJ.'^n''' as a factor to modify the 



Pr(T,|Si) 

tone probability in order to take into account the 
silence duration. In Figure ^, we give the factors 
by which we adjust the tag probabilities given the 
amount of silence. Again, due to sparse of data, 



•^Since we treat the non-null tags as lexical items, we 
associate a unique POS tag with each value. 
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Table 2: POS Tagging and Perplexity Results 



we collapse the values of the tone, editing term and 
repair tag into six classes: boundary tones, editing 
term pushes, editing term pops, modification repairs 
and fresh starts (without an editing term). From 
the figure, we see that if there is no silence between 
Wi-i and Wi, the null interpretation for the tone, 
repair and editing term tags is preferred. Since the 
independence assumptions that we have to make are 
too strong, we normalize the adjusted tone, editing 
term and repair tag probabilities to ensure that they 
sum to one over all of the values of the tags. 

6 Example 

To demonstrate how the model works, consider the 
following example. 

Example 11 (d92a-2.1 utt95) 

will take a total of um let's see total of 7 hours 

reparandum T reparandum T 

ip ip 

The language model considers all possible interpre- 
tations (at least those that do not get pruned) and 
assigns a probability to each. Below, we give the 
probabilities for the correct interpretation of the 
word "um" , given the correct interpretation of the 
words "will take a total of" . For reference, we give 
a simplified view of the context that is used for each 
probability. 

Pr(r6=null|a total of)=0.98 
Pr(£;6=Push|a total of)=0.28 
Pr(_R6=null|a total of Push) = 1.00 
Pr(P6=UH_FP|a total of Push)=0.75 
Pr(W^6=um|a total of Push UH_FP)=0.33 

Given the correct interpretation of the previous 
words, the probability of the filled pause "um" along 
with the correct POS tag, boundary tone tag, and 
repair tags is 0.0665. 

Now lets consider predicting the second instance 
of "total" , which is the first word of the alteration of 
the first repair, whose editing term "um let's see", 



which ends with a boundary tone, has just finished. 

Pr(Tio=T|Push let's see)=0.93 
Pr(£io=Pop|Push let's see Tone)=0.79 
Pr(i?io=M|a total of Push let's see Pop) = 0.26 
Pr(Oio=total|will take a total of i?io=Mod)=0.07 
Pr(Lio=total|total of i?io=Mod)=0.94 
Pr(Cio=m|will take a Lio=total/NN) = 0.87 4 
Pr(Pio=NN|will take a iio=total/NN Cio=m) = l 
Pr(VKio=total|will take a NN Lio=total Cio=m) = l 

Given the correct interpretation of the previous 
words, the probability of the word "total" along with 
the correct POS tag, boundary tone tag, and repair 
tags is 0.011. 

7 Results 

To demonstrate our model, we use a 6-fold cross 
validation procedure, in which we use each sixth of 
the corpus for testing data, and the rest for train- 
ing data. We start with the word transcriptions of 
the Trains corpus, thus allowing us to get a clearer 
indication of the performance of our model without 
having to take into account the poor performance 
of speech recognizers on spontaneous speech. All si- 
lence duratio ns are aut omatically obtained from a 
word aligner ( Ent, 1994 ). 

Table || shows how POS tagging, discourse marker 
identification and perplexity benefit by modeling the 
speaker's utterance. The POS tagging results are re- 
ported as the percentage of words that were assigned 
the wrong tag. The detection of discourse markers is 
reported using recall and precision. The recall rate 
of X is the number of X events that were correctly 
determined by the algorithm over the number of oc- 
currences of X. The precision rate is the number 
of X events that were correctly determined over the 
number of times that the algorithm guessed X. The 
error rate is the number of X events that the algo- 
rithm missed plus the number of X events that it 
incorrectly guessed as occurring over the number of 
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Table 3: Detecting Intonational Phrases Table 4: Detecting and Correcting Speech Repairs 



X events. The last measure is perplexity, which is 
a way of measuring how well the language model is 
able to predict the next word. The perplexity of a 
test set of N words wi^n is calculated as follows. 

2--^ Z)" 1 ^°S2 Pi'(«'.k'i,,-i) 

The second column of Table ^ gives the results 
of the POS-based model, the third column gives 
the results of incorporating the detection and cor- 
rection of speech repairs and detection of intona- 

tinnal pTirasp hminrlarv tnnps and fho fnnrtli rnl- 



Table ^ gives the results of detecting and correct- 
ing speech repairs. The detection results report the 
number of repairs that were detected, regardless of 
whether the type of repair (e.g. modification repair 
versus abridged repair) was properly determined. 
The second column gives the results of adding speech 
repair detection to the POS model. The third col- 
umn adds in silence information. Unlike the case for 
boundary tones, adding silence does not have much 
of an effect.0 The fourth column adds in speech re- 
pair correction, and shows that taking into account 



the correction, gives better detection rates (Hecman 



nv^r. gW... th. r..n1t. nf aHHing jr. .n.n.n hnfn.^a- LokcU-Kim, and AUCU, 19961 ). Thc fifth Column adds 



tion. As can be seen, modeling the user's utterances 
improves POS tagging, identification of discourse 
markers, and word perplexity; with the POS er- 
ror rate decreasing by 3.1% and perplexity by 5.3%. 
Furthermore, adding in silence information to help 
detect the boundary tones and speech repairs results 
in a further improvement, with the overall POS tag- 
ging error rate decreasing by 8.6% and reducing per- 
plexity by 7.8 %. In contra st, a word-based trigram 
backoff model ( |Katz, 1987| ) built w ith the CMU sta - 
tistical language modeling toolkit ( Rosenfeld, 1995 ) 
achieved a perplexity of 26.13. Thus our full lan- 
guage model results in 14.1% reduction in perplex- 
ity. 

Table ^ gives the results of detecting intonational 
boundaries. The second column gives the results 
of adding the boundary tone detection to the POS 
model, the third column adds silence information, 
and the fourth column adds speech repair detection 
and correction. We see that adding in silence infor- 
mation gives a noticeable improvement in detecting 
boundary tones. Furthermore, adding in the speech 
repair detection and correction further improves the 
results of identifying boundary tones. Hence to de- 
tect intonational phrase boundaries in spontaneous 
speech, one should also model speech repairs. 



in boundary tone detection, which improves both the 
detection and correction of speech repairs. 

8 Comparison to Other Work 

Comparing the performance of this model to oth- 
ers that have been proposed in the literature is very 
difficult, due to differences in corpora, and different 
input assumptions. However, it is useful to compare 
the different techniques that are used. 



Bear et al. (1992) used a simple pattern matching 
approach on ATIS word transcriptions. They ex- 
clude all turns that have a repair that just consists 
of a filled pause or word fragment. On this subset 
they obtained a correction recall rate of 43% and a 
precision of 50%. 



Nakatani and Hirschberg (1994) examined how 
speech repairs can be detected using a variety of 
information, including acoustic, presence of word 
matchings, and POS tags. Using these clues they 
were able to train a decision tree which achieved a 
recall rate of 86.1% and a precision of 92.1% on a set 
of turns in which each turn contained at least one 
speech repair. 



* Silence has a bigger effect on detection and correc- 
tion if boundary tones are modeled. 



Stolcke and Shriberg ( |l996b| ) examined whether 
perplexity can be improved by modehng simple 
types of speech repairs in a language model. They 
find that doing so actually makes perplexity worse, 
and they attribute this to not having a linguistic seg- 
mentation available, which would help in modeling 
filled pauses. We feel that speech repair modeling 
must be combined with detecting utterance bound- 
aries and discourse markers, and should take advan- 
tage of acoustic information. 

For detecting boundary tones, the model of 



Wightman and Ostendorf (|1994| ) achieves a recall 
rate of 78.1% and a precision of 76.8%. Their better 
performance is partly attributed to richer (speaker 
dependent) acoustic modeling, including phoneme 
duration, energy, and pitch. However, their model 
was trained and tested on professionally read speech, 
rather than spontaneous speech. 

Wang and Hirschberg ( 1992 ) did employ sponta- 
neous speech, namely, the ATIS corpus. For turn- 
internal boundary tones, they achieved a recall rate 
of 38.5% and a precision of 72.9% using a decision 
tree approach that combined both textual features, 
such as POS tags, and syntactic constituents with 
intonational features. One explanation for the differ- 
ence in performance was that our model was trained 
on approximately ten times as much data. Secondly, 
their decision trees are used to classify each data 
point independently of the next, whereas we find 
the best interpretation over the entire turn, and in- 
corporate speech repairs. 

The models of Kompe et al. (1994) and Mast et 
al. (1996) are the most similar to our model in 
terms of incorporating a language model. Mast et 
al. achieve a recall rate of 85.0% and a precision of 
53.1% on identifying dialog acts in a German cor- 
pus. Their model employs richer acoustic modeling, 
however, it does not account for other aspects of ut- 
terance modeling, such as speech repairs. 

9 Conclusion 

In this paper, we have shown that the problems 
of identifying intonational boundaries and discourse 
markers, and resolving speech repairs can be tack- 
led by a statistical language model, which uses lo- 
cal context. We have also shown that these tasks, 
along with POS tagging, should be resolved to- 
gether. Since our model can give a probability esti- 
mate for the next word, it can be used as the lan- 
guage model for a speech recognizer. In terms of 
perplexity, our model gives a 14% improvement over 
word-based language models. Part of this improve- 
ment is due to being able to exploit silence durations, 
which traditional word-based language models tend 



to ignore. Our next step is to incorporate this model 
into a speech recognizer in order to validate that the 
improved perplexity does in fact lead to a better 
word recognition rate. 
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